Text extraction with keywords from a pdf documents

I want to extract sentences which include the word “emission” from a pdf document. It is a sustainability reports and therefore has different layouts, like tables, different structured paragraphs etc. However, i don’t seem to get the full sentences extracted.

import pdfplumber import pandas as pd # Function: Extract text and tables from PDF def extract_text_and_tables(pdf_path): extracted_data = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
# Extrahiere Fließtext
page_text = page.extract_text()

# Extrahiere Tabellen
tables = page.extract_tables()
extracted_data.append({“text”: page_text, “tables”: tables})
return extracted_data

# Funktion: Statements mit “emissions” extrahieren
def extract_emissions_statements(data):
emissions_statements = []

for page_data in data: # Search text if page_data[“text”]:
for line in page_data[“text”].splitlines():
if “emission” in line.lower():
emissions_statements.append({“type”: “text”, “content”: line.strip()})

# Tabellen durchsuchen
for table in page_data[“tables”]: if table: # If table exists for row in table: # Check if “emission” occurs in the table row if any(“emission” in str(cell).lower() for cell in row): emissions_statements.append({ “type”: “table_row”, “content”: row}) return emissions_statements # Function: format results def format_emissions_statements(statements): formatted = []
for statement in statements:
if statement[“type”] == “text”:
formatted.append(statement[“content”])
elif statement[“type”] == “table_row”:
formatted.append(” | “.join(str(cell) for cell in statement[“content”])) return formatted # PDF file path pdf_path = “your_report.pdf” # Step 1: Extract PDF text and tables extracted_data = extract_text_and_tables(pdf_path) # Step 2: Extract statements with “emissions” emissions_statements = extract_emissions_statements(extracted_data) # Step 3 : Format statements formatted_statements = format_emissions_statements(emissions_statements) # Show results for idx, statement in enumerate(formatted_statements): print(f”Statement {idx+1}: {statement}”) # Optional: save results to a CSV file df = pd.DataFrame({ “Statements”: formatted_statements}) df.to_csv(“emissions_statements.csv”, index=False)

I tried the code above but my output looks like this: Statement 13: Energy and Emissions:
Statement 14: transition to Net Zero and compensation for residual emissions with
Statement 15: Reducing the Energy and Emission Intensity per Rupee of Turnover.
Statement 16: Emission Intensity (MtCO2e per Rupee of Turnover): Achieved a
Statement 17: last year, the Emission intensity has reduced by 26%. for example statement 14 and 17, it is evident that the full sentence wasnt extracted. How can i change this so i can get the whole sentence?[text]

Considering the increasing volume of sustainability reporting data, what are the ‌ethical implications of relying heavily on⁢ automated data extraction tools, and how can we ensure responsible and transparent use of these technologies in the context of environmental and social ⁣impact analysis?

⁢ ## ⁣World Today News – Interview: Extracting Meaning‍ from Sustainability⁢ Reports

**Introduction:**

Welcome to World Today ‍News. Today, we are‌ diving into the world of data extraction and sustainability⁤ reporting.

Joining us are two experts: **[Guest 1 Name and Title]**, a specialist in data analysis ‍and‍ extraction, and **[Guest 2 Name and Title]**, an expert in sustainability reporting and corporate social responsibility.

**(Theme 1: Challenges of Extracting Information from Sustainability Reports)**

*‍ **Host:** Let’s start by understanding the ‍context. These reports often have complex layouts, combining text, tables, and graphics. ⁢What are ⁣some of the challenges faced when trying to extract specific information, like emission data, from‍ them?

* **Guest ⁤1:**⁣ Can you⁤ elaborate‌ on the⁢ common pitfalls users encounter ‌when using⁤ basic text extraction ‍methods? What⁢ often goes wrong, and how does it lead to incomplete sentences or data?

*⁤ **Guest 2:** From a sustainability reporter’s perspective, how do these extraction challenges‍ affect the effectiveness of analyzing and comparing data across different companies? ‍

**(Theme 2: Techniques and Solutions)**

* **Host:** Moving⁢ forward, what are some of the advanced techniques or tools that can help overcome these challenges ‌and ensure accurate extraction of complete sentences containing ⁣the word “emission”?

* **Guest 1:** Could ⁢you‌ walk us through the advantages of using tools like PDFPlumber and Python ⁢for this purpose?

* **Guest 2:** Are there any best practices within the sustainability reporting ‍community for structuring these documents to make data‍ extraction easier for analysts?

**(Theme 3: Ethical Implications and Future⁤ Trends)**

* **Host:**‍ As we rely more on automated tools for data analysis, are there any ethical ⁣considerations we need to keep in mind when extracting and interpreting⁣ information from sustainability reports?

* **Guest 1:** What potential biases ⁤or inaccuracies ⁢could arise from automated extraction methods, and⁢ how can we mitigate these risks?

* **Guest‌ 2:** Looking ahead, how do you see the field‌ of sustainability reporting evolving in terms of data transparency and accessibility for analysis?

**(Closing)**

* ⁣**Host:** Thank you both for sharing your insights. The ability to accurately extract and ‍analyze data from sustainability reports is crucial for driving transparency and accountability in the corporate world. We hope this discussion⁤ has ⁤shed light on the ⁤challenges and opportunities‌ in this important field.

**Note to Editor:**

* ⁢Ensure that the‍ guests’ names,⁤ titles, and expertise areas are accurately ‍reflected.

* Adapt the⁣ tone and complexity ‌of ⁤the ⁢questions‍ to suit the intended audience.

* Consider adding visuals or multimedia elements to enhance the interview.

Rita Agibalova accused ex-husband Zhenya Kuzin of indifference to her son

Leonardo or Aka7even, who was eliminated at Amici 2021? / Rudy's pupil out

Photographer Erwin Olaf receives a posthumous range award

"What Tatyana Arntgolts is aspiring for in her life as she begins to lose interest in her profession...

Text extraction with keywords from a pdf documents

Related posts:

Leave a Comment Cancel reply